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Abstract 

The celebrated multi-armed bandit problem in decision theory models the central trade-off 
between exploration, or learning about the state of a system, and exploitation, or utilizing 
the system. In this paper we study the variant of the multi-armed bandit problem where the 
exploration phase involves costly experiments and occurs before the exploitation phase; and 
where each play of an arm during the exploration phase updates a prior belief about the arm. 
The problem of finding an inexpensive exploration strategy to optimize a certain exploitation 
objective is NP-Hard even when a single play reveals all information about an arm, and all 
exploration steps cost the same. 

We provide the first polynomial time constant-factor approximation algorithm for this class 
of problems. We show that this framework also generalizes several problems of interest studied 
in the context of data acquisition in sensor networks. Our analyses also extends to switching 
and setup costs, and to concave utility objectives. 

Our solution approach is via a novel linear program rounding technique based on stochastic 
packing. In addition to yielding exploration policies whose performance is within a small con- 
stant factor of the adaptive optimal policy, a nice feature of this approach is that the resulting 
policies explore the arms sequentially without revisiting any arm. Sequentiality is a well-studied 
paradigm in decision theory, and is very desirable in domains where multiple explorations can 
be conducted in parallel, for instance, in the sensor network context. 

1 Introduction 

The sequential design of experiments is a classic problem first formulated by Wald in 1947 |49j . 
The study of this problem gave rise to the general field of decision theory; and more specifically, 
led Robbins [4T] to formulate the celebrated multi-armed bandit problem, and Snell [46] and Rob- 
bins [H] to invent the theory of optimal stopping. The copious literature in this field is surveyed 
by Whittle [SUES]. 

The canonical problem of sequential design of experiments is best described in the language of 
the multi-armed bandit problem: There are n competing options referred to as "arms" (for instance, 
consider clinical treatments) yielding unknown rewards (or having unknown effectiveness) {pi}- 
Playing an arm (or testing a treatment on a patient) yields observations that reveal information 
about the underlying reward or effectiveness. The goal is to sequentially test the treatments (or 
sequentially play the arms) in order to ultimately choose the "best" one. Such problems are usually 
studied in a decision theoretic setting, where costs and utilities are associated with actions (testing 
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a treatment) and outcomes (choosing one treatment finally). The goal of any decision procedm'e is 
to come up with a plan for testing the treatments (or playing the arms) and choosing an outcome in 
order to optimize some criterion based on the costs and utilities. The testing procedure is termed 
exploration, and choosing the outcome is termed exploitation. The crux of the multi-armed bandit 
problem, and the reason has been extensively studied, is that it cleanly models the general trade-ofF 
between the cost of exploration (or learning more about the state of the system) and the utility 
gained from exploitation (or utilizing the system). 

Various frameworks in decision theory differ in (i) the available information and (ii) optimization 
criteria for evaluating a decision plan. We now describe the problem we study from the perspective 
of these design choices. From the perspective of available information, we focus exclusively on the 
Bayesian setting, first formulated by Arrow, Blackwell and Girshick in 1949 [2j. In this setting, each 
arm (or treatment) is associated with prior information (specified by distributions) that updates 
via Bayes' rule conditioned on the results of the plays (or tests). More formally, we are given a 
bandit with n independent arms. The set of possible states of arm i is denoted by Si, and the initial 
state is pi £ Si. When the arm i is played in a state u £ Si, the arm transitions to state v € Si w.p. 
Puv depending on the observed outcome of the play. The initial state models the prior knowledge 
about the arm. The states in general capture the posterior conditioned on the observations from 
a sequence of plays (or experiments) starting at the root. The cost of a play depends on whether 
the previous play was for the same arm or not. If the previous play was for the same armjthe play 
at u Si costs Cu , else it costs Cu + hi, where hi is the setup cost for switching into arm u. Recall 
that the arms correspond to different treatments or experiments; therefore, this cost models setting 
up the corresponding experiment. Every state u G 5j is associated with a reward which is the 
expected reward of playing in this state (which is of course conditioned on the observations from 
the plays so far). By Bayes' rule, the reward of the different states evolve according to a Martingale 
property: r„ = Ylv&Si P™^f • present concrete examples of state spaces in Section [2l 

From the optimization perspective, our objective is to maximize future utilization. Any policy 
explores (or tests) the arms for a certain amount of time and subsequently, exploits (or chooses) an 
arm that yields the best expected posterior (or future) reward. For this objective to be meaningful, 
we need to constrain the total cost we can incur in exploration before making the exploit decision. 
A natural example of this is product marketing research, where the entire exploration phase appears 
before the exploitation phase. Formally, a policy vr performs a possibly adaptive sequence of plays 
during the exploration. Since the state evolutions are stochastic, the exploration phase leads to a 
probability distribution over outcomes, 0{'k). In outcome o E 0{'k), each arm i is in some final 
state u°. In this outcome o the policy will choose the "best arm" maxj r„o (or a suitable concave 
function of the vector (• • • ,ru°, •••))• The expected reward of the policy vr over the outcomes of 
exploration, R{'k) is X]oeC>(7r) 'i'(o, tt) maxj r^o. Let C(o, vr) denote the cost of the exploration plays 
made by the policy given an outcome o. In the simplest version, we seek to find the policy vr 
which maximizes 7^(vr) subject to C(o, vr) < C for all o G O. As remarked in |2], this problem 
is solvable by dynamic programming [111 [T3] . However this approach requires computation time 
polynomial in the joint state space (truncated by the budget constraint) for multiple arms, which 
is the product of the individual (truncated) state spaces. Unsurprisingly, the problem becomes 
NP-Hard even when a single play reveals the full information about an arm, and all plays (across 
different arms) cost the same [27J. Designing a policy which is computationally tractable, at the 
cost of bounded loss in performance, is the main goal of this paper. We will study the problem 
from the perspective of approximation algorithms, where we seek to find a provably near optimal 

^Our algorithms also extend to concave costs where the cost of r consecutive play as well as switching out costs, 
we omit that discussion here. 
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solution with the restriction that the algorithm must run in time polynomial in the sum of the state 
spaces. More precisely, we seek an algorithm which would give us an utilization least OPT /a where 
OPT = max7r7?.(7r) subject to C(o, vr) < C for all o G O; which is denoted as an a approximation. 
Note that we seek a multiplicative approximation because such a result is invariant under scaling of 
the rewards (see also the discussion on discount rewards below). Since it is NP-Hard to determine 
OPT, we seek to use a linear program to determine an upper bound 7* > OPT and provide 
an algorithm that achieves 7* /a in the worst case. The added benefit of such an approach is 
that we have a concrete upper bound 7* for comparison and an algorithm which guarantees 7* /a 
in the worst case, may have a significantly better (and quantifiable, due to the existence of the 
upper bound) performance in practice. The interested reader may consult [48] for a review of 
approximation algorithms. 

The necessity of studying this problem is further hastened by the emergence of several appli- 
cations where the number of arms is large, typically data intensive applications. Examples of this 
problem arise in "active learning" |381 [i2] where the goal is to learn and choose the most discerning 
hypothesis by sequentially testing the hypotheses on a set of assisted examples; sensor networks |35j . 
where the goal is sensor placement to maximize a utility function such as information gain, based 
on sequentially collecting a small number of samples; and databases [7], where the goal is to settle 
upon a possibly long running query execution plan, again based on a few carefully chosen samples. 

1.1 Related Models 

The future utilization objective is well-known in literature (refer for instance, Berry and Frist- 
edt [12], Chapter 3.6). The unit cost version of this problem is a special case of the infinite horizon 
discounted multi-armed bandit problem. In the discounted bandit problem, there is an infinite 
discount sequence {at G [0, l\\t = 1, 2, . . .}. Any policy vr plays an arm at each time step; suppose 
the expected reward from playing at time t is Rt{'K). The goal is to design an adaptive policy vr to 
maximize Ylt>i o^tPti'^)- The future utilization objective with an exploration budget C corresponds 
to ai = a2 = • • • = Oic = ac+2 = oc+3 = • • • = 0, and ac+i = 1- This setting implies the objective 
is the reward of the arm chosen at the {C + 1)*** play (exploitation), and only plays of significance 
for making this choice are the first C plays (exploration). As observed in [12j . this problem seems 
significantly harder computationally than the case where the discount sequence is monotonically 
decreasing with time. In fact, when the discount sequence is geometric, i.e., at = /3* for some 
P < 1, the celebrated result of Gittins and Jones shows that there exists an elegant greedy optimal 
solution termed the Gittins index policy [26]; an index policy ranks the arms based solely on their 
own characteristics and plays the best arm at every step. The Gittins index is suboptimal both 
the finite horizon setting where at = 1 for t < C and otherwise; as well as the future utilization 
setting we consider here [38] • Finally, Banks and Sundaram [lOj show that no index exists in the 
presence of switching in/out costs. 

Alternatives to the Bayesian formulation are also as old as the original study of Wald [l9] and 
Robbins [H]. These versions do not assume prior information, but instead perform a min-max 
optimization over possible underlying rewards via a suitably constructed loss or regret measure. As 
observed in [12^ I50j. although minmax objectives are more robust, the Bayesian approach is more 
widely used since it typically requires less samples. Furthermore, the regret criterion naturally 
forces the optimization to consider the past: What is the minimum loss in the past N trials due to 
not knowing the true rewards. Note that minimizing regret is not the same as maximizing future 
utilization, the former being more akin to the finite horizon version with discount sequence at = 1 
for t < C and otherwise. Intuitively, in the former, we attempt to minimize the error during the 
testing process, while in the latter, we do not care about errors in testing, but attempt to ensure 
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that at the end, we are truly picking the (near) best option for exploitation. 

Nevertheless, it is natural to ask whether the algorithms suggested in the context of minmax 
analysis, particularly the seminal works of Lai and Robbins ^36j, and Auer, Cesa-Bianchi and 
Fischer [4j (and extended to uniform switching costs in |47[ [3] ) . have good performance guarantees 
in the future utilization measure. However these are "model free" algorithms, and it is easy to show 
that for appropriately chosen budget C, these algorithms have significantly inferior performance 
on the future utilization objective as compared to algorithms that use the prior information. This 
is not surprising because the objectives are different. Similar comments apply to the "experts" 
problem [18] and subsequent research in adversarial multiarmed bandits O [25] where the reward 
distribution is chosen by an adversary and need not be stochastic. 

It is worth pointing out that in the loss function or minmax approach, the loss or regret arises 
due to lack of information about the rewards. The difficulty in optimizing future utilization in 
the Bayesian setting arises from the computational aspect. This is quite similar to the differences 
between the classes of online and approximation algorithms. 

1.2 Structure of the Policies 

For the future utilization measure, it is worth mentioning that the general structure of the policies 
are important. Two such classes of policies are noteworthy. The first class is motivated by the 
stopping time problem, an early example of which is the secretary problem |20]. A policy in this 
class fixes an ordering of the arms in advance, and samples the arms sequentially, i.e., does not 
return to previously rejected arm. The benefit of such strategy is that these are often succinct 
to represent and easy to implement in real hardware from the perspective of control. Another 
benefit, as the reader would have observed, is that it is easy to model switching/setup costs in 
such policies; these costs in fact can be generalized so that r consecutive plays have a cost which 
is concave function in r. We define such policies as sequential, because the ordering of the arms 
is fixed beforehand. Such strategies have been considered in testing between two hypothesis [49j . 
stochastic scheduling [391 SS] j stochastic packing [231 121] and in operator placement in databases 
[HI [9] - however all except the hypotheses testing results hold for two-level state spaces (or arms 
with point priors), where a single play reveals complete information about the underlying reward 
of the arm. (Refer Section [2] for a formal definition.) 

The second and more restrictive class of policies performs all the tests (or plays) before observing 
any of their outcomes. Therefore, the policy has three disjoint successive phases: Test, observe, and 
select. Such non-adaptive policies are of interest when the observations can be made in parallel, 
and therefore the final choice can be made quicker. Naturally these strategies are meaningful for 
two level state spaces, and have thus been found to be of interest in context of sensor networks [35] . 
multihoming networks [Ij, stochastic optimization [271 130j and database optimization [7j. 

For both the above classes, the goal is to show that performance of an algorithm that is restricted 
to the respective class is not significantly worse compared to an adversary whose strategy is fully 
adaptive. This is known as the Adaptivity Gap of a strategy. All previous analysis of adaptivity gap 
was restricted to two level state spaces. This paper provides an uniform framework that extends 
to both the classes above and applies to multilevel state spaces. It is interesting to note that 
one of the original goals of Wald [49j in sequential analysis was to explore sequential strategies. 
Though such strategies are optimal for choosing between two hypothesis, the difficulty in obtaining 
optimal strategies for testing multiple competing hypotheses was known since that time. The 
major contribution of this work is to show that in a variety of bandit settings, when we are seeking 
to optimize any concave function of the posterior probabilities, the adaptivity gap in considering 
sequential strategies is bounded by a constant. In other words, the performance of a fully adaptive 
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solution cannot be significantly better than a sequential strategy. 
1.3 Problems and Results 

We consider three main types of problems in this paper. Recall that there are n independent arms, 
each with its own state space Si; a policy vr adaptively explores the arms paying expected cost C{-k) 
before selecting an arm for exploitation based on the observed outcomes. The expected reward of 
the selected arm over the outcomes of the policy vr is denoted R{-k). 

• Budgeted (Futuristic) Bandits: There is a cost budget C. A policy tt is feasible if for any 
sequence of plays made by the policy, the cost is at most C. The goal is to find the feasible 
policy TT with maximum R{Tr). We have already discussed switching costs. An extension of 
switching cost is concave play cost where the cost of sequential interrupted plays of an arm 
is concave in the number of plays. This was first hinted at in [2j and the authors explicitly 
settled on linear costs. 

A generalization of the above problem is budgeted concave utility bandits problem where 
the objective function is an arbitrary concave function of the final rewards of the arms. 
Examples of such function include choosing the best K arms, power allocation across noisy 
channels ^21j or optimizing "TCP friendly" network utility functions |37j . 

• Model Driven Optimization: This is a non-adaptive formulation of the above, where the 
state space Si is 2-level and a single play reveals full information about an arm. In such a 
context, non-adaptive strategies are desirable since the plays can be executed in parallel. A 
feasible non-adaptive policy vr chooses a subset of the arms to explore, before seeing the result 
of any of the plays. There has been a significant number of papers in recent years, specially 
in the context of sensor networks. Our paper unifies this thread with the bandit framework. 

• Lagrangean (Futuristic) Bandits: Find the policy vr with maximum R{7r) — C(7r). Note 
that the Lagrangean can be defined on both the adaptive and non-adaptive setting. This is 
a natural extension of the single-arm optimal stopping time problem. 

In this paper, we present a single framework that provides efficient algorithms yielding policies 
with near-optimal performance for all of the above problems. For the budgeted (futuristic) bandits 
in the concave cost setting (including switching in/out cost), we show that there exists a sequential 
strategy that respects the budget, and has objective value at most a factor 4 away from that of the 
optimal fully-adaptive strategy subjected to the same budget. Section [2] discusses different state 
spaces. This is presented in Section [3] presents the approximate sequential strategy that respects the 
budget, for linear utilities (objective function). We also present a bicriteria 2(l-|-a) approximation 
with the cost constraint relaxed by a factor ^. In Section [H we show how the same framework gives 
a more restricted non-adaptive strategy for 2-level states spaces which is within constant factor of 
the best adaptive strategy. In contrast, for multi-level state spaces, any non-adaptive strategy has 
a significant performance loss. We also present a sequential strategy that is a 2 approximation for 
the Lagrangean Bandits in Section [5l In Section [6l we extend the results in Section [3] to concave 
utilities with a factor 2 loss of the approximation factor. 

Note that constant factor approximations are best possible from the context of adaptivity gap 
of sequential policies as well as integrality gap of the linear programming relaxations we use. 

Techniques: We use a linear programming formulation over the state space of individual arms, and 
we achieve polynomial sized formulation in the size of each individual state space. This particular 
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formulation has been used in the past [531 HO] and found to be useful in practice. To the best of 
our knowledge, we present the first analysis of these relaxations in the finite horizon context. 

We also bring to bear techniques from stochastic packing literature, particularly the work on 
adaptivity gaps by Dean, Goemans and Vondrak [23l |24l |22] . Their results can be viewed as sequen- 
tial strategies for 2-level state spaces and is similar to the online nature of the policies considered in 
stochastic scheduling \39\ [45] , where there is a strong notion of "irrevocable commitment" . While 
the online notion is related to sequential strategies, they are not the same. 

In terms of analysis, our results can be thought of as extending analysis both to arbitrary state 
spaces as well as for non-adaptive strategies for the 2-level case. Our overall technique can be 
thought of as "LP rounding via stochastic packing" - finding this connection between finite horizon 
multi-armed bandits and stochastic packing by designing simple LP rounding policies for a very 
general class of budgeted bandit problems represents the key contribution of this work. 

Related Work: Several heuristics had been proposed for the budgeted (futuristic) bandit problem 
by Schneider and Moore [42j and Madani et al. [38j. The final algorithm that arises from our 
framework bears resemblance (but is not the same) to the algorithms proposed therein, but as far 
as we are aware there was no prior analysis of any algorithm in this context. A series of papers 
|27l l35l [30] considered the 2-level state spaces (where a single play resolves all information about 
an arm) for specific problems and presented approximations. The Lagrangean (futuristic) bandit 
problem with 2-level state space has been considered before in ^1], where a 1.25 approximation is 
presented. None of those techniques apply for the iterative refinement that is required for multiple 
level state spaces. Note that most other literature on stochastic packing do not consider refinement 
of information [331 [28] . 

Our LP relaxation is well-studied in the context of multi-armed bandit problems [15[ [551 US] 
and other loosely coupled systems such as multi-class queueing systems [Ml [17] ; we present the first 
provable analysis of this formulation. Though LP formulations over the state space of outcomes 
exist for other stochastic optimization problems such as multi-stage optimization with recourse [341 
[121 [19], these formulations are based on sampling scenarios. However these problems also do not 
have a notion of refinement, and are fundamentally different from our setting where the scenarios 
would be refinement trajectories [32] that are hard to sample. 

2 Types of State Spaces 

Recall that each arm is associated with a state that evolves when the arm is played. The state 
captures the distributional knowledge about the reward distribution of the arm. Formally, the set 
of possible states of arm i is denoted by Si, and the initial state is pi £ Si. When the arm i is played 
in a state u £ Si, the arm transitions to state v £ Si w.p. p„„ depending on the observed outcome 
of the play. The initial state models the prior knowledge about the arm. The states in general 
capture the posterior conditioned on the observations from a sequence of plays (or experiments) 
starting at the root. Every state u £ Si is associated with a reward Vu, which is the expected 
reward of playing in this state (which is of course conditioned on the observations from the plays so 
far). By Bayes' rule, the reward of the different states evolve according to a Martingale property: 

We now present two representative scenarios in order to better motivate the abstract problem 
formulation. In the first scenario, the underlying reward distribution is deterministic, and the 
distributional knowledge is specified as a distribution over the possible deterministic values; this 
implies that the uncertainty about an arm is completely resolved in one play by observing the 
reward. In the second scenario, the uncertainty resolves gradually over time. 
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Two-level State Space. A two-level state space models the case where the underlying reward 
of the arm is deterministic, so that the prior knowledge is a distribution over these values. In 
this setting, a single play resolves this distribution into a deterministic posterior. Formally, the 
prior distributional knowledge Xi is a discrete distribution over values {a\,a2, ■ ■ ■ ^al^}, so that 
Pr[A'j = o*] = p'j for j = 1,2, ... ,m. The state space Si of the arm is as follows: The root node 
Pi has Tp. = E[Xj] = yUj. For j = 1,2, ... ,m, state ij has ri. = a*-, and Pp-j^. = p*-. Since the 
underlying reward distribution is simply a deterministic value, the state space is 2-level, defining a 
star graph with pi being the root, and ii,i2, . . . ,im being the leaves. 

To motivate budgeted bandits in such state spaces, consider a sensor network where the root 
server monitors the maximum value [U Hlj . The probability distributions of the values at various 
nodes are known to the server via past observations. However, at the current step, probing all 
nodes to find out their actual values is undesirable since it requires transmissions from all nodes, 
consuming their battery life. Consider the simple setting where the network connecting the nodes 
to the server is a one-level tree, and probing a node consumes battery power of that node. Given a 
bound on the total battery life consumed, the goal of the root server is to maximize (in expectation) 
its estimate of the maximum value. Formally, each node corresponds to a distribution Xi with mean 
Pi] the exact value sensed at the node can be found by paying a "transmission cost" Cj. The goal of 
the server is to adaptively probe a subset S of nodes with total transmission cost at most C in order 
to maximize the estimate of the largest value sensed, i.e maximize E[max (maxjgs' Xj, maxj^5 pi)], 
where the expectation is over the adaptive choice of S and the outcome of the probes. The term 
maxj^5 Pi incorporates the mean of the unprobed nodes into the estimate of the maximum value. 

In this context, it is desirable for the sensor node to probe the nodes in parallel, i.e., use a 
non-adaptive strategy. The question then becomes how good is such a strategy compared to the 
optimal adaptive strategy. We show positive results for the context of 2-level spaces in Section [H 

Multi-level State Spaces. These are the most general state spaces we consider, and make sense in 
contexts such as clinical trials where the underlying effectiveness of a treatment is a random variable 
following a parametrized distribution with unknown parameters. The prior distribution will then 
be a distribution over possible parameter values. In the clinical trial setting, each experimental 
drug is a bandit arm, and the goal is to devise a clinical trial phase to maximize the belief about 
the effectiveness of the drug finally chosen for marketing. Each drug has an effectiveness that is 
unknown a priori. The effectiveness can be modeled as a coin whose bias, 9, is unknown a priori 
- the outcomes of tossing the coin (running a trial) are and 1 which correspond to a trial being 
ineffective and effective respectively. The uncertainty in the bias is specified by a prior distribution 
(or belief) on the possible values it can take. Since the underlying distribution is Bernoulli, its 
conjugate prior is the Beta distribution. A Beta distribution with parameters ai, 02 G {1, 2, . . .}, 
which we denote B{ai,a2) has p.d.f. of the form c9'^^~^{l — 9)°^'^^^, where c is a normalizing 
constant. .6(1, 1) is the uniform distribution, which corresponds to having no a priori information. 
The distribution 5(01,02) corresponds to the current (posterior) distribution over the possible 
values of the bias 9 after having observed (ai — 1) O's and (02 — 1) I's- Given this distribution as 
our belief, the expected value of the bias or effectiveness is — ^ — . 

The state space Si is a DAG, whose root pi encodes the initial belief about the bias, B{ai, 02), so 
that Tp^ = When the arm is played in this state, the state evolves depending on the outcome 

observed - if the outcome is 1, which happens w.p. ^^^^ , the child u has belief B{a + 1,02), so 
that ru = , and ppu = ^"+02 ' outcome is 0, the child v has belief B{ai,a2 + 1), 

'''v = ai+a2+i ' ~ ai+a2 ' general, if the DAG Si has depth C (corresponding to playing 

the arm at most C times), it has O(C^) states. We omit details, since Beta distributions and their 
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multinomial generalizations, the Dirichlet distributions, are standard in the Bayesian context (refer 
for instance Wetherill and Glazebrook [50]). 



3 Budgeted Bandits 

We are given a bandit with n independent arms. The set of possible states of arm i is denoted by 
Si, and the initial state is pi £ Si. When the arm i is played in a state u € Si, the arm transitions to 
state V £ Si w.p. p^^,. The reward at a state satisfies r„ = X^^g^. Puv^v The cost of a play depends 
on whether the previous play was for the same arm or not. If the previous play was for the same 
arm, the play at u £ Si costs Cu, else it costs Cu + hi, where hi is the setup cost for switching into 
arm i. A policy vr performs a possibly adaptive sequence of plays during the exploration, leading 
to a probability distribution over outcomes, 0{7r). In outcome o G 0{tt), each arm i is in some 
final state u°. In this outcome o the policy chooses max^ r^o . The expected reward of the policy tt 
over the outcomes of exploration, i?(vr) is J2oe<D(TT) 1(0,71) maxiru° ■ Let C(o, vr) denote the cost of 
the exploration plays made by the policy given an outcome o. In this section, we seek to find the 
policy TT which maximizes TZ{7r) subject to C(o, vr) < C for all o € O. 

We describe the linear programming formulation and rounding technique that yields a 4- 
approximation. We note that the formulation and solution are polynomial in n, the number of 
arms, and m, the number of states per arm. 

3.1 Linear Programming Formulation 

Recall the notation from Section 11.31 Consider any adaptive policy vr. For some arm i and state 
u £ Si, let: (1) Wu denote the probability that during the execution of the policy vr, arm i enters 
state u £ Si] (2) Zu denote the probability that the state of arm i is n and the policy plays arm i in 
this state; and (3) Xu denote the probability that the policy vr chooses the arm i in state u during 
the exploitation phase. Note that since the latter two correspond to mutually exclusive events, 
we have Xu + Zu < Wu- The following LP which has three variables Wu,Xu, and Zu for each arm i 
and each u G Si. A similar LP formulation was proposed for the multi-armed bandit problem by 
Whittle [53| and Bertsimas and Nino-Mora 



Maximize XuVu 

i=l u^Si 

SuGSi ^"P™ = yi,ueSi\{pi} 

Xu + Zu < Wu \/ueSi,\/i 

Xu,Zu,Wu G [0,1] ViiG5j,Vi 

Let 7* be the optimal LP value, and OPT be the expected reward of the optimal adaptive policy. 
Claim 3.1. OPT < 7*. 

Proof. We show that the Wu,Zu,Xu as defined above, corresponding to the optimal policy vr*, are 
feasible for the constraints of the LP. Since each possible outcome of exploration leads to choosing 
one arm i in some state u £ Si for exploitation, in expectation over the outcomes, one arm in one 
state is chosen for exploitation. This is captured by the first constraint. Further, since on each 
sequence of outcomes (the decision trajectory), the cost of playing and switching into the arm is at 
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most C, over the entire decision tree, the expected cost of switching into the root states pi plus the 
expected cost of play is at most C . This is captured by the second constraint. Note that the LP 
only takes into account the cost of switching into an arm the very first time this arm is explored, 
and ignores the rest of the switching costs. This is clearly a relaxation, though the optimal policy 
might switch multiple times into any arm. However, our rounding procedure switches into an arm 
at most once, preserving the structure of the LP relaxation. 

The third constraint simply encodes that the probability of reaching a state u £ Si during 
exploration. It is precisely the probability with which it is played in some state v £ Si, times the 
probability p^^ that it reaches u conditioned on that play. The constraint Xu + < Wu simply 
captures that playing an arm is a disjoint event from exploiting it in any state. The objective is 
precisely the expected reward of the policy. Hence, the LP is a relaxation of the optimal policy. □ 

3.2 The Single-arm Policies 

The optimal LP solution clearly does not directly correspond to a feasible policy since the variables 
do not faithfully capture the joint evolution of the states of different arms. Below, we present an 
interpretation of the LP solution, and show how it can be converted to a feasible approximately 
optimal policy. 

Let (tt)*,x*,z*) denote the optimal solution to the LP. We can assume w.l.o.g. that w*. = 1 
for all i. Ignoring the first two constraints of the LP for the time being, the remaining constraints 
encode a separate policy for each arm as follows: Consider any arm i in isolation. The play starts 
at state pi. The arm is played with probability z*. , so that state u £ Si is reached with probability 
Zp.pp-u- This play incurs cost hi + Cp. , which captures the cost of switching into this arm, and the 
cost of playing at the root. At state pi, with probability x*. , the play stops and arm i is chosen 
for exploitation. The events involving playing the arm and choosing for exploitation are disjoint. 
Similarly, conditioned on reaching state u £ Si, with probabilities z*/'w^ and arm i is played 

and chosen for exploitation respectively. This yields a policy 4>i for arm i which is described in 
Figured! For policy (j)i, it is easy to see by induction that if state u £ Si is reached by the policy 
with probability tf*, then state u £ Si is reached and arm i is played with probability z*. 

The policy (pi sets £i = Hi on termination, arm i was chosen for exploitation, li £i = 1 at state 
u £ Si, then exploiting the arm in this state yields reward r^. Note that £i is a random variable 
that depends on the execution of policy Let Ri,Ci denote the random variables corresponding 
to the exploitation reward, and cost of playing and switching, respectively. 

Policy 0^: If arm i is currently in state n, then choose q £ [0, w*] uniformly at random: 

1. If g G [0, z*], then play the arm (explore). 

2. If g G (z*, z* + x*], then stop executing (j)i, set Si = 1 (exploit). 

3. If g G {z* + x* , tu*], then stop executing (pi, set £i = 0. 

Figure 1: The Policy (pi. 
For policy (pi, define the following quantities: 

1. P{(pi) = E[<fj] = X^weSi = 1 A ti] = YlueSi ^u'- Probability the arm is exploited. 

2. R{(Pi) = ^Ri] = = 1 A n] = E 

ueSi ^u'^u'- Expected reward of exploitation. 

3. C{(Pi) = E[Q] = h^z* + E ueSi '^uZu- Expected cost of switching into and playing this arm. 
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Let (p denote the policy that is obtained by executing each <f)i independently in succession. Since 
policy (pi is obtained by considering arm i in isolation, (j) is not a feasible policy for the following 
reasons: (i) The cost ^ • C, spent exploring all the arms need not be at most C in every exploration 
trajectory, and (ii) It could happen that for several arms i, £i is set to 1, which implies several 
arms could be chosen simultaneously for exploitation. 

However, all is not lost. First note that the r.v. Ri,Ci,£i for different i are independent. 
Furthermore, it is easy to see using the first two constraints and objective of the LP formulation 
that (p is feasible in the following expected sense: E[Ci] = C{(t)i) < C. Secondly, E[<£'j] = 
Y.^P{<t^r) < 1- Finally, E^E[ii.] = E.«(0.) = 7*- 

Based on the above, we show that policy (p can be converted to a feasible policy using ideas 
from the adaptivity gap proofs for stochastic packing problems [231 [Ml ES] ■ We treat each policy 
(pi as an item which takes up cost Cj, has size and profit i?,. These items need to be placed in 
a knapsack - placing item i corresponds to exploring arm i according to policy (pi. This placement 
is an irrevocable decision, and after the placement, the values of Ci,£i,Ri are revealed. We need 
Y2i Ci for items placed so far should be at most C. Furthermore, the placement (or exploration) 
stops the first time some £i is set to 1, and uses arm i is used for exploitation (obtaining reward 
or profit Ri). Since only one £i = 1 event is allowed before the play stops, this yields the "size 
constraint" '^i^i < 1- The knapsack therefore has both cost and size constraints, and the goal 
is to sequentially and irrevocably place the items in the knapsack, stopping when the constraints 
would be violated. The goal is to choose the order to place the items in order to maximize the 
expected profit, or the exploitation gain. This is a two-constraint stochastic packing problem. The 
LP solution implies that the expected values of the random variables satisfy the packing constraints. 

We show that the "start-deadline" framework in ^22] can be adapted to show that there is a fixed 
order of exploring the arms according to the (pi which yields gain at least 7*/4. There is one subtle 
point - the profit (or gain) is also a random variable correlated with the size and cost. Furthermore, 
the "start deadline" model in |22j would also imply the final packing could violate the constraints 
by a small amount. We get around this difficulty by presenting an algorithm GreedyOrder that 
explicitly obeys the constraints, but whose analysis will be coupled with the analysis of a simpler 
policy GreedyViolate which exceeds the budget. The central idea would be that although the 
benefit of the current arm has not been "verified", the alternatives have been ruled out. 

3.3 The Rounding Algorithm 

The GreedyOrder policy is shown in Figure [2l Note that step dS]) ensures that no arm is ever 
revisited, so that the strategy is sequential. For the purpose of analysis, we first present an infeasible 
poUcy GreedyViolate which is simpler to analyze. The algorithm is the same as GreedyOrder 
except for step ((21), which we outline in Figure [3l 

In GreedyViolate, the cost budget is checked only a/ter fully executing a policy (pj. Therefore, 
the policy could violate the budget constraint by at most the exploration cost Cmax of one arm. 

Theorem 3.2. GreedyViolate spends cost at most C -I- Cmax ^^nd yields reward at least . 

Proof. We have 7* = R{(pi), and P{<Pi) < 1- We note that the random variables correspond- 
ing to different i are independent. 

For notational convenience, let = R{(pi), and let fii = P{(pi) + C{(pi)/C. We therefore have 
/Wj ^ 2. The sorted ordering is decreasing order of fj/^j. Re-number the arms according to the 
sorted ordering so that the first arm played is numbered 1. Let k denote the smallest integer such 
that X]i=i ^ 1- By the sorted ordering property, it is easy to see that Ym=i ^« — ^7*- 
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Algorithm GreedyOrder 

asine order of 



1. Order the arms in decreasing order of „,^^^'^c(0-) and choose the arms to play in this 



order. 

2. For each arm j in sorted order, play arm j according to (pj as follows until (pj termi- 
nates: 

(a) If the next play according to (pj would violate the budget constraint, then stop 
exploration and goto step ([3]). 

(b) If has terminated and £j = 1, then stop exploration and goto step ([3]). 

(c) Else, play arm j according to policy (pj and goto step (I2ap . 

3. Choose the last arm played in step ([2]) for exploitation. 



Figure 2: The GreedyOrder policy. 



Step [2] (GreedyViolate) For each arm j in sorted order, do the following: 

(a) Play arm j according to policy (pj until cpj terminates. 

(b) When the policy (pj terminates execution, if event £j = 1 is observed or the cost 
budget C is exhausted or exceeded, then stop exploration and goto step 



Figure 3: The GreedyViolate pohcy. 

Arm i is reached and played by the policy iff 'Yuj<i^i — 0> J2j<i^j < ^- This translates 
to X;j<i [Sj + §-)<!• Note that B[£j + §■] = Pi(pj) + C{(pj)/C = nj. Therefore, by Markov's 

inequality, Pr Ylj<i {^j + < ^ — iiiax(0, 1 — J2j<iJ^j)- Note further that for i < k, we have 
fJ-i < 1. 

If arm i is played, it yields reward Vi that directly contributes to the exploitation reward. Since 
the reward is independent of the event that the arm is reached and played. Therefore, the expected 
reward of GreedyViolate can be bounded by linearity of expectation as follows. 

k 

Reward of GreedyViolate= Q > - y^^fij)i^i 

i=l j<i 

We now follow the proof idea in [22] . Consider the arms 1 < i < k as deterministic items with item 
i having profit and size //j. We therefore have X]i=i ^ 7*/2 and Yli=i Mi ^ 1- 

Suppose these items are placed into a knapsack of size 1 in decreasing order of — with the 
last item possibly being fractionally placed. This is the same ordering that the algorithm uses 
to play the arms. Let ^{q) denote the profit when size of the knapsack filled is g < 1. We 
have '5(1) > 7*/2. Plot the function ^{q) as a function of q. This plot connects the points 
{(0, 0), (/ii, fi), (/zi + fi2,vi + ^2), •••(!, This function is concave, therefore the area under 

the curve is at least > 7*/4. However, the area under this curve is at most 



Vl + ■U2(l - /il) + . . . + Vk{l - ^ Hj) < Q 



j<k 

Therefore, G > 7*/4. Since OPT < 7*, ^ is at least □ 
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Theorem 3.3. The GreedyOrder policy with cost budget C achieves reward at least 



Proof. Consider the GreedyViolate pohcy. This pohcy could exceed the cost budget because 
the budget was checked only at the end of execution of policy for arm i. Now suppose the play 
for arm i reaches state u € Si, and the next decision of GreedyViolate involves playing arm 
i and this would exceed the cost budget. The GreedyViolate policy continues to play arm i 
according to cpi and when the play is finished, it checks the budget constraint, realizes that the 
budget is exhausted, stops, and chooses arm i for exploitation. Suppose the policy was modified so 
that instead of the decision to play arm i further at state u, the policy instead checks the budget, 
realizes it is not sufficient for the next play, stops, and chooses arm i for exploitation. This new 
pohcy is precisely GreedyOrder. 

Note now that conditioned on reaching node u with the next decision of GreedyViolate 
being to play arm i, so that the policies GreedyViolate and GreedyOrder diverge in their 
next action, both policies choose arm i for exploitation. By the martingale property of the rewards, 
the reward from choosing arm i for exploitation at state u is the same as the expected reward from 
playing the arm further and then choosing it for exploitation. Therefore, the expected reward of 
both policies is identical, and the theorem follows. □ 

3.4 Bi-criteria Result 

Suppose we allow the cost budget to be exceeded by a factor a > 1, so that the cost budget 
is aC. Consider the GreedyOrder policy where the arms are ordered in decreasing order of 
aP(<j>^+c\(l> )/c ^ ^^'^ budget constraint is relaxed to aC. We have the following theorem: 

Theorem 3.4. For any a > 1, if the cost budget is relaxed to aC , the expected reward of the 
modified GreedyOrder policy is 2{-i+a) ^* ■ 

Proof. We mimic the proof of Theorem l3.2l and define = R{(t)i), and let /ij = P{(j)i) + ■^C{(j)i)/C. 
Note that the LP satisfies the constraint J2i + ^^^) < We therefore have J2i < 

Let k denote the smallest integer such that Yli=i /^i ^ 1- By the sorted ordering property, we 
have Yli=i — 1^7* • The rest of the proof remains the same, and we show that the reward of 
the new policy, Q, satisfies: Q > ^^(1), and $(1) > 2(i+a) ^* • This completes the proof. □ 

3.5 Integrality Gap of the Linear Program 

We now show via a simple example that the linear program has an integrality gap of at least 
e/(e — 1) ~ 1.58. All arms i = 1,2, ... ,n have identical 2-level state spaces. Each Si has Cp = 1, 
Tp = 1/n, switching cost hi = 0, and two other states uq and ui. We have Ppuo = ^ — ^/n, 
Ppui = l/'^i fuo = 0, r^^ = 1. Set C = n, so that any policy can play all the arms. The expected 
reward of such a policy is precisely 1 — (1 — 1/n)" ~ 1 — 1/e. The LP solution will set z* = 1 and 
x*^ = 1/n for all i, yielding an LP objective of 1. This shows that the linear program cannot yield 
better than a constant factor approximation. It is an interesting open question whether the LP can 
be strengthened by other convex constraints to obtain tighter bounds (refer for instance [22]). 

4 Non-adaptive Policies: Bounding the Adaptivity Gap 

Recall that a non-adaptive strategy allocates a fixed budget to each arm in advance. It then explores 
the arms according to these budgets (ignoring the outcome of the plays in choosing the next arm 
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to explore), and at the end of exploration, chooses the best arm for exploitation. This is termed 
an allocational strategy in [38]. Such strategies are desirable since they allow the experimenter 
to consider various competing arms in parallel. We show two results in this case: For general 
state spaces, we show that such a non-adaptive strategy can be arbitrarily worse than the optimal 
adaptive strategy. On the positive side, we show that for 2-level state spaces, which correspond to 
deterministic underlying rewards (refer Section [2]) , a non-adaptive strategy is only a factor 7 worse 
than the performance of the optimal adaptive strategy. 

4.1 Lower Bound for Multi-level State Spaces 

We first present an example with unit costs where an adaptive strategy that dynamically allocates 
the budget achieves far better exploitation gain than a non-adaptive strategy. Note that we can 
ignore switching costs in such strategies. 

Theorem 4.1. The adaptivity gap of the budgeted learning problem is ^l{^/n). Furthermore, even 
if we allow the non-adaptive exploration to use 7 > 1 times the exploration budget, the adaptivity 
gap remains Q,{y^n/^). 

Proof. Each arm has an underlying reward distribution over the three values ai = 0, 02 = 1/n^ and 
03 = 1. Let q = l/^/n. The underlying distribution could be one of 3 possibilities: Ri, R2, R^,- Ri is 
the deterministic value oi, R2 is deterministically 02 and R3 is 03 w.p. q and 02 w.p. 1 — q. For each 
arm, we know in advance that Pr[i?i] = 1 — q, Pr[i22] = ^(1 — q) and Pr[i?3] = q^. Therefore, the 
knowledge for each arm is a prior over the three distributions Ri, R2, Rs- The priors for different 
arms are i.i.d. All Cj = 1 and the total budget is C = 5n. 

We first show that the adaptive policy chooses an arm with underlying reward distribution i^s 
with constant probability. This policy first plays each arm once and discards all arms with observed 
reward oi. With probability at least 1/2, there are at most 2/q arms which survive, and at least 
one of these arms has underlying reward distribution R^. If more arms survive, choose any 2/q 
arms. The policy now plays each of the 2/q arms 2y/n times. The probability that an arm with 
distribution R3 yields reward 03 on some play is at least once is 1 - (1 - « 9(1). In this 
case, it chooses the arm with reward distribution R^ for exploitation. Since this happens w.p. at 
least a constant, the expected exploitation reward is Q{q)- Note that this is best possible to within 
constant factors, since E[i?3] = Q{q). 

Now consider any non-adaptive policy. With probability 1 — l/n®^^\ there are at most 21ogn 
arms with reward distribution i?3, and at least l/{2q) arms with reward distribution R2. Let 
r ^ 21ogn. The strategy allocates at most 5r plays to at least n(l — 1/r) arms - call this set of arms 
T. With probability (1- l/r)2i°g" = 0(1 - (2 log n)/r), all arms with reward distribution R3 lie in 
this set T. For any of these arms played 0{r) times, with probability l — 0{qr), all observed rewards 
will have value 02- This implies with probability 1 — 0{qr), all arms with distribution R^ yield 
rewards 02, and so do Q{l/{2q)) arms with distributions i?2- Since these appear indistinguishable 
to the policy, it can at best choose one of these at random, obtaining exploitation reward 1(1^ ~ 
O(g^logn). Since this situation happens with probability 1 — 0(logn/r), and with the remaining 
probability the exploitation reward is at most q, the strategy therefore has expected exploitation 
reward 0{qlogn{^ + q)). This implies the adaptivity gap is Q{l/q) = Q{^/n) if we set r = 1/q. 

Now suppose we allow the budget to be increased by a factor of 7 > 1. Then the strategy would 
allocate at most 5'jr plays to at least n(l — 1/r) arms. By following the same argument as above, 
the expected reward is 0(glogn(i -|- 97)). This proves the second part of the theorem. □ 
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4.2 Upper Bound for Two-Level State Spaces 



We next show that for 2-level state spaces, which correspond to deterministic underlying rewards 
(refer Section [2|) , the adaptivity gap is at most a factor of 7. 

Theorem 4.2. If each state space Si is a directed star graph with pi as the root, then there is a 
non-adaptive strategy that achieves reward at least 1/7 the LP hound. 

Proof. In the case of 2-level state spaces, a non-adaptive strategy chooses a subset S of arms and 
allocates zero/one plays to each of these so that the total cost of the plays is at most C. We 
consider two cases based on the LP optimal solution. 

In the first case, suppose ^pi^pi ^ 7*/^, then not playing anything but simply choosing the 
arm with highest r^. directly for exploitation is a 7-approximation. 

In the remaining proof, we assume the above is not the case, and compare against the optimal 
LP solution that sets Xp- = for all i. This solution has value at least 67*/7. For simplicity of 
notation, define Zi = Zp. as the probability that the arm i is played. Define = YliueSi ^ 
the probability that the arm is exploited conditioned on being played, and Ri = j- Ylu^s ^u^u 
as the expected exploitation reward conditioned on being played. Also define Cj = Cp. . The LP 
satisfies the constraint: Zj -|- Xj) < 2, and the LP objective is ZiRi, which has value at 
least 67* /7. 

A better objective for the LP can be obtained by considering the arms in decreasing order of 
g^"^' , and increasing Zi in this order until the constraint Yli^i (?* + ^i) ^ 1 becomes tight. Set 

the remaining Zj = 0. It is easy to see Y^- ZiRi > ^7*. At this point, let k denote the index of the 
last arm which could possibly have z^ < 1, and let S denote the set of arms with Zi = 1 for i £ S. 
There are again two cases. 

In the first case, if ZkRk > 7*/7, then choosing just this arm for exploitation has reward at least 
7*/7, and is a 7-approximation. 

In the second and final case, we have a subset of arms J2i£S ic ~^ -^i) — 1' ^^'^ X]je5-^« — 
77* ~ 7*/^ = 77*- If ^-ll these arms are played, the expected number of arms that are exploited 
is ^i^s — f' expected reward is X^jg^-Rj > 77* • The proof of Theorem 13.21 can be 

adapted to show that choosing the best arm for exploitation yields at least half the reward, i.e., 
reward at least 7*/7. □ 



5 Lagrangean Version 

Recall from Section 11.31 that in the Lagrangean version of the problem, there are no budget con- 
straints on the plays, the goal is to find a policy it such that -R(vr) — C{tt) is maximized. Denote 
this quantity as the profit of the strategy. 

The linear program relaxation is below. The variables are identical to the previous formulation, 
but there is no budget constraint. 

n / 

Maximize ^ ^ (x„r„ - c^z^) - hiZp^ 



yi,u e Si\ {pi} 
Vn G 5,-,Vi 



4 = 1 \ 








< 


1 


'v€St ZvPvu 




Wu 




< 


Wu 


1 Zu 1 Wu 




[0,1 
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Let OPT = optimal net profit and 7* = optimal LP solution. The next is similar to Claim 13. 1[ 



Claim 5.1. OPT < 7*. 

From this LP optimum (to*, , z*), the policy c^j is constructed as described in Figuredl and the 
r.v.'s £i,Ci,Ri and their respective expectations P{(j)i),C{4>i), and R{4>i) are obtained as described 
in the beginning of Section 13. 2[ Let r. v. Yi = Ri — Ci denote the profit of playing arm i according 

to (t)i. Note that ^[Yi] = {JZuas, (^"'^^ ~ '^f^") ~ ^i^pr)- 

The nice aspect of the proof of Theorem 13.21 is that it does not necessarily require the r.v. 
corresponding to the reward of policy (t)i, Ri to be non-negative. As long as E[i?j] = R(4>i) > 0, 
the proof holds. This will be crucial for the Lagrangean version. 

Claim 5.2. For any arm i, E[Yi\ = R{(j)i) - C{4>i) > 0. 

Proof. For each i, since all > 0, setting Xp. <— X^^g^. Xu, Wp- <— 1, and <— for G Si 
yields a feasible non- negative solution. The LP optimum will therefore guarantee that the term 
T^ueSi {^uTu - CuZu) - hiZp- > 0. Therefore, E[yj] > for all i. □ 

The GreedyOrder policy orders the arms in decreasing order of -^^^^p^^y^, and plays them 
according to their respective 4>i until some £i = 1. 

Theorem 5.3. The expected profit 0/ GreedyOrder is at least OPT/2. 

Proof. Let /ij = P{(j)i) and Vi = E[y^] for notational convenience. The LP solution yields ^^fJ-i < 1 
and Y^- Ui =7*. Re- number the arms according to the sorted ordering of ^ so that the first arm 
played is numbered 1. 

The event that GreedyOrder plays arm i corresponds to X]j<j £j = 0. By Markov's inequality, 

we have PrE,<.^^i = 0] = PrE,<i^, < 1] > 1 - E,<./^i- 

If arm i is played, it yields profit YJ. This implies the profit of GreedyOrder is X^j^(l — 
Ylij<i^j)- Since Yi is independent of Ylij<i^j-> since Claim [5?2] implies E[yj] > 0, the expected 
profit Q of GreedyOrder can be bounded by linearity of expectation as follows. 



B[Yi\>J2^^\l-Y.^'J 



j<i 



We now follow the proof idea in [22] . Consider the arms 1 < i < n as deterministic items with item 
i having profit fii and size //j. We therefore have i^i > 7* and fii < 1. Using the same proof 
idea as in Theorem 13.21 it is easy to see that G > -y- Since OPT < 7* , ^ is at least ^^y^ . □ 



6 Concave Utility Functions 

The above framework in fact solves the more general problem of maximizing any concave stochastic 
objective function over the rewards of the arms subject to a (deterministic) packing constraint. 
Several such examples of concave objective function are given in [37] in the context of optimizing 
"TCP friendly" network utility functions. In what follows, we extend our arguments in the previous 
section to develop approximation algorithms for all positive concave utility maximization problems 
in this exploration-exploration setting. Suppose arm i in state u G Si has a value function gu{y) 
where y G [0, 1] denotes the weight assigned to it in the exploitation phase. We enforce the following 
properties on the function guiv)- 
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s.t. "^(TiVi < B,\/i yi G [0,1] 



Concavity. gu{y) is an arbitrary positive non-decreasing concave function of y. 

Super-Martingale. gu{y) > Et,e5, P™5i>(y)- 

Given an outcome o G 0{-k) of exploration, suppose arm i ends up in state u, and is assigned 
weight yi in the exploitation phase, the contribution of this arm to the exploitation value is guiVi)- 
The assignment of weights is subject to a deterministic packing constraint ^ • aiyi < B, where 
(jj G [0,-B]. Therefore, for a given outcome o G 0{tt), the value of this outcome is given by the 
convex program: 

n 

max 

i=l i=l 

The goal as before is to design an adaptive exploration phase vr so that the expected exploitation 
value is maximized, where the expectation is over the outcomes 0{tt) of exploration and cost of 
exploration is at most C. 

• For the maximum reward problem, gu{y) = r^y, CTi = 1, and B = \. 

• Suppose we wish to choose the m best rewards, we simply set B = m. Note that we can also 
conceive of a scenario where the q correspond to cost of "pilot studies" and each treatment 
i requires cost for large scale studies. This would lead us to a Knapsack type problem 
where are now the "sizes". 

6.1 Linear Program 

The state space Si and the probabilities p^^ are defined just as in Section [l.3[ For small constant 
e > 0, let L = J. Discretize the domain [0,1] in multiples of 1/L. For / G {0,1,..., L}, let 
Cu{l) = gu{l/ L). This corresponds to the contribution of arm i to the exploitation value on allocating 
weight yi = l/L. Define the following linear program: 

n L 

Max ^^^^^^xMl) 

1=1 uGSi 1=0 
T.v:u€D{v) ^vPvu = Wu Vi, U ^ Si\ {pi} 

wu,XuhZu G [0,1] yueSi,yi,i 

Let 7* be the optimal LP value and OPT = value of the optimal adaptive exploration policy. 
Lemma 6.1. OPT < 7*. 

Proof. In the optimal solution, let Wu denote the probability that the policy reaches state u £ Si, 
and let Zu denote the probability of reaching state u £ Si and playing arm i in this state. For / > 1, 
let Xui denote the probability of stopping exploration at it G 5j and allocating weight yi G 7;] 
to arm i. All the constraints are straightforward, except the constraint involving B. Observe that 
if the weight assignments yi in the optimal solution were rounded up to the nearest multiple of 
1/L, then the total size of any assignment increases by at most eB since all Si < B. Therefore, this 
constraint is satisfied. Using the same rounding up argument, if the weight satisfies yi G (^^, j^], 
then the contribution of arm i to the exploitation value is upper bounded by Cm(0 since the function 
guiy) is non-decreasing in y. Therefore, the proof follows. □ 
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Policy 0j: If arm i is currently in state u 


choose q S 




u.a.r. and do one of the 


following: 








1. If q £ [0, z*], then play the arm. 








2. else Stop executing (/){. 








Find the smallest I > such that q < z* 




Set £i 


= I and Ri = C«(0- 



Figure 4: The policy 0j for concave value functions. 



6.2 Exploration Policy 

Let {WujXl^ijZ*) denote the optimal solution to the LP. Assume w*. = 1 for all i. Also w.l.o.g, 

+ X^^o^ui ~ "^u for all u £ Si. The LP solution yields a natural (infeasible) exploration policy 
(j) consisting of one independent policy (j)i per arm i. Policy is described in Figured 

The policy is independent of the states of the other arms. It is easy to see by induction that 
if state u £ Si \s reached by the policy with probability tu*, then state u £ Si is reached and arm 
i is played with probability z*. Let random variable Cj denote the cost of executing (j)i, and let 
C{(j)i) = E[Cj]. Denote this overall policy (p - this corresponds to one independent decision policy 
(/>j (determined by {w^^, x*^i-, z^,)) per arm. It is easy to see that the following hold for cj): 

1. C{(t)i) = E[a] = + E„e5, c«< so that C{(t)i) < C. 

2. P(<A.) = n£^] = T T.ueS. Ef=0 l<l ^^Pi^^) < ^(1 + 

3. R{cPi) = Em = j:uesXi=o<iUi) i:,R{<Pi) = i*- 



Algorithm GreedyOrder 

1. Order the arms in decreasing order of a. ^, , ^ ■ 

2. For each arm j in sorted order, play it according to as follows until (f)j terminates: 

(a) If the next play would violate the cost constraint, then set £j <— 1, stop explo- 
ration, and goto step ([3]). 

(b) If (f)j terminates and Ci<fi > -B, then stop exploration and goto step ([3]). 

(c) Else, play arm j according to policy and goto step (I2ap . 

3. Exploitation: Scale down 8i by a factor of 2. 



Figure 5: The GreedyOrder policy for concave functions. 

The GreedyOrder policy is presented in FigureO We again use an infeasible policy GreedyVi- 
OLATE which is simpler to analyze. The algorithm is the same as GreedyOrder except for step 
([2]), where violation of the cost constraint is only checked after the policy (pj terminates. 

Theorem 6.2. Let Cmax denote the maximum cost of exploring a single arm. Then GreedyVio- 
LATE spends cost at most C + Cmax O'^d has expected value ^^^(1 — e). 

Proof. Let Vi = R{(j)i) and let fii = ^P{(f)i) + ^C{(f)i). The LP constraints imply that 7* = 
'^ii'i, and ^ifii < 2 + e. Now using the same proof as Theorem 13. 2^ we obtain the value Q 
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of GreedyViolate according to the weight assignment £i at the end of Step ([2]) is at least 
'^^'^ (1 — e). This weight assignment could be infeasible because of the last arm, so that the Ei only 
satisfy Yli ^i^i ^ This is made feasible in Step ([3|) by scaling all Si down by a factor of 2. 
Since the functions gi{y) are concave in y, the exploitation value reduces by a factor of 1/2 because 
of scaling down. □ 

Theorem 6.3. GreedyOrder policy with budget C achieves expected value at least ^^^(1 — e). 

Proof. Consider the GreedyViolate policy. Now suppose the play for arm i reaches state u £ Si, 
and the next decision of GreedyViolate involves playing arm i and this would exceed the cost 
budget. Conditioned on this next decision, GreedyOrder sets £i = 1 and stops exploration. In 
this case, the exploitation value of GreedyOrder from arm i is at least the expected exploitation 
gain of GreedyViolate for this arm by the super-martingale property of the value function g. 
Therefore, for the assignments at the end of Step ([2]), the gain of GreedyOrder is at least 
^^j^(l — e). Since Step ([3]) scales the f s down by a factor of 2, the theorem follows. □ 



7 Conclusions 

We studied the classical stochastic multi-armed bandit problem under the future utilization ob- 
jective in the presence of priors. This model is relevant to settings involving data acquisition and 
design of experiments. In this problem the exploration phase necessarily precedes the exploitation 
phase. This makes the problem significantly different from the problems in online optimization, 
which seeks to minimize regret over the past, because online optimization models problems where 
exploration and exploitation are simultaneous. The central difficulty of online optimization is the 
lack of information, whereas the difficulty in optimizing future utilization is computational. In 
fact the latter is provably NP-Hard. We presented constant factor approximation algorithms that 
yield sequential policies for several extensions of this basic problem. These algorithms proceed via 
LP rounding and show a surprising connection to stochastic packing algorithms. We also show that 
the sequential policy we develop is within constant factor of a fully adaptive solution. Note that a 
constant factor adaptivity gap result is the best possible. 

There are several challenging open questions arising from this work; we mention two of them. 
First, we conjecture that constructing a (possibly adaptive) strategy for the budgeted learning 
problem is APX-Hard, i.e., there exists an absolute constant c > 1 such that it is NP-Hard 
to produce a solution which is within factor c times the optimum. Secondly, we have focused 
exclusively on utility maximization; it would be interesting to explore other objectives, such as 
minimizing residual information [35]. 

Acknowledgment: We would like to thank Jen Burge, Vincent Conitzer, Ashish Goel, Ronald 
Parr, and Fernando Pereira for helpful discussions. 
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